Goto

Collaborating Authors

 dependence plot



Causal Dependence Plots

Neural Information Processing Systems

To use artificial intelligence and machine learning models wisely we must understand how they interact with the world, including how they depend causally on data inputs. In this work we develop Causal Dependence Plots (CDPs) to visualize how a model's predicted outcome depends on changes in a given predictor . Crucially, this differs from standard methods based on independence or holding other predictors constant, such as regression coefficients or Partial Dependence Plots (PDPs).


SHAP Distance: An Explainability-Aware Metric for Evaluating the Semantic Fidelity of Synthetic Tabular Data

Yu, Ke, Ishikura, Shigeru, Usukura, Yukari, Shigoku, Yuki, Hayashi, Teruaki

arXiv.org Machine Learning

Synthetic tabular data, which are widely used in domains such as healthcare, enterprise operations, and customer analytics, are increasingly evaluated to ensure that they preserve both privacy and utility. While existing evaluation practices typically focus on distributional similarity (e.g., the Kullback-Leibler divergence) or predictive performance (e.g., Train-on-Synthetic-Test-on-Real (TSTR) accuracy), these approaches fail to assess semantic fidelity, that is, whether models trained on synthetic data follow reasoning patterns consistent with those trained on real data. To address this gap, we introduce the SHapley Additive exPlanations (SHAP) Distance, a novel explainability-aware metric that is defined as the cosine distance between the global SHAP attribution vectors derived from classifiers trained on real versus synthetic datasets. By analyzing datasets that span clinical health records with physiological features, enterprise invoice transactions with heterogeneous scales, and telecom churn logs with mixed categorical-numerical attributes, we demonstrate that the SHAP Distance reliably identifies semantic discrepancies that are overlooked by standard statistical and predictive measures. In particular, our results show that the SHAP Distance captures feature importance shifts and underrepresented tail effects that the Kullback-Leibler divergence and Train-on-Synthetic-Test-on-Real accuracy fail to detect. This study positions the SHAP Distance as a practical and discriminative tool for auditing the semantic fidelity of synthetic tabular data, and offers practical guidelines for integrating attribution-based evaluation into future benchmarking pipelines.


Causal Dependence Plots

Neural Information Processing Systems

To use artificial intelligence and machine learning models wisely we must understand how they interact with the world, including how they depend causally on data inputs. In this work we develop Causal Dependence Plots (CDPs) to visualize how a model's predicted outcome depends on changes in a given predictor


Investigating Application of Deep Neural Networks in Intrusion Detection System Design

Jeje, Mofe O.

arXiv.org Artificial Intelligence

Despite decades of development, existing IDSs still face challenges in improving detection accuracy, evasion, and detection of unknown attacks. To solve these problems, many researchers have focused on designing and developing IDSs that use Deep Neural Networks (DNN) which provides advanced methods of threat investigation and detection. Given this reason, the motivation of this research then, is to learn how effective applications of Deep Neural Networks (DNN) can accurately detect and identify malicious network intrusion, while advancing the frontiers of their optimal potential use in network intrusion detection. Using the ASNM-TUN dataset, the study used a Multilayer Perceptron modeling approach in Deep Neural Network to identify network intrusions, in addition to distinguishing them in terms of legitimate network traffic, direct network attacks, and obfuscated network attacks. To further enhance the speed and efficiency of this DNN solution, a thorough feature selection technique called Forward Feature Selection (FFS), which resulted in a significant reduction in the feature subset, was implemented. Using the Multilayer Perceptron model, test results demonstrate no support for the model to accurately and correctly distinguish the classification of network intrusion.


Cybersecurity Assessment of Smart Grid Exposure Using a Machine Learning Based Approach

Jeje, Mofe O.

arXiv.org Artificial Intelligence

Given that disturbances to the stable and normal operation of power systems have grown phenomenally, particularly in terms of unauthorized access to confidential and critical data, injection of malicious software, and exploitation of security vulnerabilities in a poorly patched software among others; then developing, as a countermeasure, an assessment solutions with machine learning capabilities to match up in real-time, with the growth and fast pace of these cyber-attacks, is not only critical to the security, reliability and safe operation of power system, but also germane to guaranteeing advanced monitoring and efficient threat detection. Using the Mississippi State University and Oak Ridge National Laboratory dataset, the study used an XGB Classifier modeling approach in machine learning to diagnose and assess power system disturbances, in terms of Attack Events, Natural Events and No-Events. As test results show, the model, in all the three sub-datasets, generally demonstrates good performance on all metrics, as it relates to accurately identifying and classifying all the three power system events.


Mapping Walnut Water Stress with High Resolution Multispectral UAV Imagery and Machine Learning

Wang, Kaitlyn, Jin, Yufang

arXiv.org Artificial Intelligence

Effective monitoring of walnut water status and stress level across the whole orchard is an essential step towards precision irrigation management of walnuts, a significant crop in California. This study presents a machine learning approach using Random Forest (RF) models to map stem water potential (SWP) by integrating high-resolution multispectral remote sensing imagery from Unmanned Aerial Vehicle (UAV) flights with weather data. From 2017 to 2018, five flights of an UAV equipped with a seven-band multispectral camera were conducted over a commercial walnut orchard, paired with concurrent ground measurements of sampled walnut plants. The RF regression model, utilizing vegetation indices derived from orthomosaiced UAV imagery and weather data, effectively estimated ground-measured SWPs, achieving an $R^2$ of 0.63 and a mean absolute error (MAE) of 0.80 bars. The integration of weather data was particularly crucial for consolidating data across various flight dates. Significant variables for SWP estimation included wind speed and vegetation indices such as NDVI, NDRE, and PSRI.A reduced RF model excluding red-edge indices of NDRE and PSRI, demonstrated slightly reduced accuracy ($R^2$ = 0.54). Additionally, the RF classification model predicted water stress levels in walnut trees with 85% accuracy, surpassing the 80% accuracy of the reduced classification model. The results affirm the efficacy of UAV-based multispectral imaging combined with machine learning, incorporating thermal data, NDVI, red-edge indices, and weather data, in walnut water stress estimation and assessment. This methodology offers a scalable, cost-effective tool for data-driven precision irrigation management at an individual plant level in walnut orchards.


survex: an R package for explaining machine learning survival models

Spytek, Mikołaj, Krzyziński, Mateusz, Langbein, Sophie Hanna, Baniecki, Hubert, Wright, Marvin N., Biecek, Przemysław

arXiv.org Machine Learning

Summary: Due to their flexibility and superior performance, machine learning models frequently complement and outperform traditional statistical survival models. However, their widespread adoption is hindered by a lack of user-friendly tools to explain their internal operations and prediction rationales. To tackle this issue, we introduce the survex R package, which provides a cohesive framework for explaining any survival model by applying explainable artificial intelligence techniques. The capabilities of the proposed software encompass understanding and diagnosing survival models, which can lead to their improvement. By revealing insights into the decision-making process, such as variable effects and importances, survex enables the assessment of model reliability and the detection of biases. Thus, transparency and responsibility may be promoted in sensitive areas, such as biomedical research and healthcare applications.


Exploration of the Rashomon Set Assists Trustworthy Explanations for Medical Data

Kobylińska, Katarzyna, Krzyziński, Mateusz, Machowicz, Rafał, Adamek, Mariusz, Biecek, Przemysław

arXiv.org Machine Learning

The machine learning modeling process conventionally culminates in selecting a single model that maximizes a selected performance metric. However, this approach leads to abandoning a more profound analysis of slightly inferior models. Particularly in medical and healthcare studies, where the objective extends beyond predictions to valuable insight generation, relying solely on a single model can result in misleading or incomplete conclusions. This problem is particularly pertinent when dealing with a set of models known as $\textit{Rashomon set}$, with performance close to maximum one. Such a set can be numerous and may contain models describing the data in a different way, which calls for comprehensive analysis. This paper introduces a novel process to explore models in the Rashomon set, extending the conventional modeling approach. We propose the $\texttt{Rashomon_DETECT}$ algorithm to detect models with different behavior. It is based on recent developments in the eXplainable Artificial Intelligence (XAI) field. To quantify differences in variable effects among models, we introduce the Profile Disparity Index (PDI) based on measures from functional data analysis. To illustrate the effectiveness of our approach, we showcase its application in predicting survival among hemophagocytic lymphohistiocytosis (HLH) patients - a foundational case study. Additionally, we benchmark our approach on other medical data sets, demonstrating its versatility and utility in various contexts. If differently behaving models are detected in the Rashomon set, their combined analysis leads to more trustworthy conclusions, which is of vital importance for high-stakes applications such as medical applications.


Causal Dependence Plots

Loftus, Joshua R., Bynum, Lucius E. J., Hansen, Sakina

arXiv.org Artificial Intelligence

Explaining artificial intelligence or machine learning models is increasingly important. To use such data-driven systems wisely we must understand how they interact with the world, including how they depend causally on data inputs. In this work we develop Causal Dependence Plots (CDPs) to visualize how one variable--an outcome--depends on changes in another variable--a predictor--$\textit{along with any consequent causal changes in other predictor variables}$. Crucially, CDPs differ from standard methods based on holding other predictors constant or assuming they are independent. CDPs make use of an auxiliary causal model because causal conclusions require causal assumptions. With simulations and real data experiments, we show CDPs can be combined in a modular way with methods for causal learning or sensitivity analysis. Since people often think causally about input-output dependence, CDPs can be powerful tools in the xAI or interpretable machine learning toolkit and contribute to applications like scientific machine learning and algorithmic fairness.